[GLUTEN-11782][CORE] Optimize parquet metadata validation by sampling#12042
Conversation
|
Run Gluten Clickhouse CI on x86 |
706c87f to
970fb06
Compare
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
51b280f to
e99a01e
Compare
|
Run Gluten Clickhouse CI on x86 |
|
hi @jinchengchenghh, make a optimized sample validation which can improve the performance. |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
8186b6c to
93de3c8
Compare
|
Run Gluten Clickhouse CI on x86 |
2 similar comments
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
1b65c49 to
e99a01e
Compare
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
… root paths When a table has many partitions, the metadata validation checks every root path with `fileLimit` files each, resulting in excessive I/O cost. This patch introduces a sampling mechanism that selects a percentage of root paths for validation instead of checking all of them. The file limit is distributed evenly across the sampled paths. Key changes: - Add config `spark.gluten.sql.fallbackUnexpectedMetadataParquet.samplePercentage` with default value 0.1 (10% sampling) - Use evenly spaced interval sampling for good partition coverage - Add unit tests for the sampling logic
314d4b7 to
f2334a0
Compare
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
|
hi @jinchengchenghh as discussed in the #11782, have implement the sample module which can improve the performance of validation, could you make a review when you have a change , thans! |
What changes are proposed in this pull request?
When a table has many partitions, the metadata validation checks every root path with
fileLimitfiles each, resulting in excessive I/O cost.This patch introduces a sampling mechanism that selects a percentage of root paths for validation instead of checking all of them. The file limit is distributed evenly across the sampled paths.
Key changes:
spark.gluten.sql.fallbackUnexpectedMetadataParquet.samplePercentagewith default value 0.1 (10% sampling)How was this patch tested?
Existing tests in ParquetEncryptionDetectionSuite continue to pass without modification.
Was this patch authored or co-authored using generative AI tooling?
ISSUE: #11782